Building the happiness demo was the first time I tried to integrate the new bokeh server (released in v0.11) with django. Having developed a lot of django web applications, I wanted to prove out a number of uses that I thought were important to demonstrate with bokeh server.
Along the way I learned some things about how the bokeh server works, learned about different choices I could make in the implementation, and also learned some things that we could improve in the server!
Given that everyone's use case is different, I thought it would be useful to document this learning.
With the happiness demo, I wanted to build a system that:
The first implementation of happiness is tagged with happiness_v1
- https://github.com/bokeh/bokeh-demos/tree/happiness_v1/happiness
It's worth noting that in this first implementation I didn't actually have users log in as this makes exploring the app take more time. By clicking on a user in the left menu we were treating that like a login - but the exact same setup would have worked had logins been used.
individual
, individuals
, team
, teams
)user_pk_source
that was used to pass the user_id from django to bokehpull_session
which caused bokeh to generate a new sessionperiodic_callback
pull_session
with no session_id to have bokeh generate one for youThe following code asks bokeh server to generate a session for the plot script that's in individual.py
and generate a unique session.
bokeh_session = pull_session(session_id=None, url='http://localhost:5006/individual/')
To make my plots available I launched bokeh server with bokeh serve viz/individual.py viz/individuals.py
this resulted in plots being available at http://localhost:5006/individual/ and http://localhost:5006/individuals/.
There is no way to change those urls, they are based on the file name, although you can add a prefix with the --prefix
option passed to bokeh serve
.
In addition, if you use the bokeh app folder structure http://bokeh.pydata.org/en/0.11.0/docs/user_guide/server.html#directory-format then you can only have one plot in each directory.
While something like django's template engine is very forgiving, bokeh server isn't. User's aren't going to get any helpful message if the plot fails to load, and django isn't going to know that the plot is going to fail to load. As a result it's important to make sure all inputs are available for bokeh and appropriate fallbacks are available.
autoload_server
has a lot of optionsThere are a number of examples of using autoload_server in the bokeh examples, but there are actually a lot of different ways to use it and it's worth checking out the docs http://bokeh.pydata.org/en/0.11.0/docs/reference/embed.html#bokeh.embed.autoload_server
In this implementation I had to setup a ColumnDataSource called user_pk_source
to pass the user_id (or object.pk
) to bokeh. The ColumnDataSource is always a dictionary of lists. This meant passing the user_id in the form
{
'user_pk': [user_id]
}
The code that did this is:
user_source = bokeh_session.document.get_model_by_name('user_pk_source')
user_source.data = dict(user_pk=[self.object.pk])
This was pretty clunky, what would have been nice would be some kind of custom session attribute so we could do
# Warning - does not currently work
bokeh_session.user_pk = self.object.pk
There is an open issue https://github.com/bokeh/bokeh/issues/3349 to add this feature.
It's worth noting that the method document.get_model_by_name
was useful in making this relatively concise and transparent. For this to work though you must add the source to the document so that django can access it:
document.add_root(plot)
document.add_root(user_pk_source)
As noted above, I needed to add the user_pk_source
to the Document
so that I could access it with the document.get_model_by_name
method. However, it turned out to be very important which order that you add things to the Document. The plot
had to be added first in order to ensure that the plot rendered correctly. There is an open pull request fixing this constraint: https://github.com/bokeh/bokeh/pull/3624
To allow bokeh to use the Django ORM, I needed to start the bokeh server with the django settings available. The bokeh server was started with the command
PYTHONPATH=$PWD DJANGO_SETTINGS_MODULE=webapp.settings bokeh serve viz/individual.py viz/individuals.py viz/team.py viz/teams.py --log-level=info --host=localhost:5006 --host=localhost:8001
In addition, I needed bokeh to make a call to django.setup() once. In each plot declaration I had the following code:
from viz.utils import django_setup
if not django_setup:
import django
django.setup()
django_setup = True
The following command was use to start bokeh server. In it we give the django server (which is running on port 8001) the ability to access bokeh server as well as giving the bokeh server permission to access itself.
PYTHONPATH=$PWD DJANGO_SETTINGS_MODULE=webapp.settings bokeh serve viz/individual.py viz/individuals.py viz/team.py viz/teams.py --log-level=info --host=localhost:5006 --host=localhost:8001
The periodic callback in bokeh was set to 5 seconds so that the database wasn't being hit too often. To setup the periodic callback I did
document.add_periodic_callback(update_data, 5000)
However, this delayed loading of the initial plot. To compensate for this I added the following
def update_data_once():
update_data()
document.add_timeout_callback(update_data_once, 250)
add_timeout_callback
adds a callback to be invoked once, after a given amount of time - 250ms in this case.
What I wanted to do was just:
# Warning - does not currently work
document.add_timeout_callback(update_data, 250)
document.add_periodic_callback(update_data, 5000)
Unfortunately this doesn't work at the moment and bokeh kicks up an error because we've tried to add the same callback method twice. As a result I needed to add the dummy method update_data_once
.
In this demo, plots have legends, that change based on the user's data. It turns out that the legend layout calculation is done only once. This means that if you're not careful, you can end up with your legend being laid out poorly with text overlapping the legend glyph. The main change I made which seemed to prevent this giving a modest delay to add_timeout_callback
. I didn't dig into this a lot, just got something working.
Had I not scrapped this implementation I would have made some improvements
The code that populated the django session was:
bokeh_session = pull_session(session_id=None, url='http://localhost:5006/%s/' % suffix)
user_source = bokeh_session.document.get_model_by_name('user_pk_source')
user_source.data = dict(user_pk=[self.object.pk])
script = autoload_server(None, app_path='/%s' % suffix, session_id=bokeh_session.id)
This was leaving bokeh sessions open, I should have cleaned up after myself by calling bokeh_session.close()
as soon as I had finished changing the data:
bokeh_session = pull_session(session_id=None, url='http://localhost:5006/%s/' % suffix)
user_source = bokeh_session.document.get_model_by_name('user_pk_source')
user_source.data = dict(user_pk=[self.object.pk])
bokeh_session.close()
script = autoload_server(None, app_path='/%s' % suffix, session_id=bokeh_session.id)
Bokeh does not currently have a way to send partial updates of a data source. That means that if the data source is changed, the whole data goes down the web socket again. My data was being updated on every periodic callback (every 5 seconds). This wastes bandwidth and I'm not sure what it would have done to performance on a slow connection.
Partial patching of data sources is in the works for bokeh, but in the meantime, I could have been checking to see whether the data had changed and only if it had update the data source.
There may have been other optimizations - such as only getting the data for the timeframe the user was looking at.
Bokeh server comes with a number of features that allow you to ensure that sessions are only made when you want them to be. These are tucked away in the docs here: http://bokeh.pydata.org/en/0.11.0/docs/user_guide/cli.html#session-id-options
If a not-logged in user tried to hit my bokeh server directly they would only be served with a blank plot which is probably fine, but using these options would have improved security further. Given that bokeh already had access to django's settings, they probably could have shared the django secret key.
Here's some more context from a key author of the new bokeh server
The purpose of the signed session ID is to keep someone from connecting to the bokeh server directly (without django app) and getting a session. As you say, if the sessions are empty anyway until your Django app fills them in, it's harmless (other than resource usage) for people to connect, so you wouldn't have to use external-signed mode. But say for example your sessions did have some type of proprietary information (perhaps not per-user info, just info that should only be accessible to someone who's logged in or someone who's a member of a certain group), and that info was in the session by default - then you might want external-signed so that Django could control access to the bokeh server.
I would think using external-signed is a good default practice in this sort of app, since it's more locked down and it presumably isn't useful to visit the bokeh server directly.
source: https://github.com/bokeh/bokeh-demos/pull/13#issuecomment-169482037
If I was putting this into production, I would probably have had the bokeh plot fallback to a text glyph saying "i'm sorry there was a problem loading your plot" or something similar to give a better user experience if the plot failed to load.
The big problem that I saw with this implementation was that it was that every session was hitting the database every 5 seconds. Although this was fine at small scale it seemed like a waste of resources and unlikely to scale well. In addition, knowing that django has on_save hooks - meaning I could take an action when data was updated - I wanted to do better.
After talking with some people I learned some more fundamentals about how bokeh server works and that there are better ways of setting this particular example up.
I had been thinking of a Session
as an instance of a Document
. It's not. A Session
is just a container that can hold a Document
. It can hold any Document
and you don't have to declare that Document
in the bokeh app code! This may be obvious to some, but was a bit of a revelation to me. I had only seen examples that used the Document
or curdoc
and then served up that file with bokeh serve
and I hadn't connected all the dots.
What connected the dots was Havoc highlighting difference between pull_session
and push_session
. Both create a session. But pull_session
gives you the state from the bokeh server and in push_session
you give the state to bokeh server. The state is an instance of Document
. The key is then deciding which side is authoritative. If bokeh server is authoritative then you would want to use pull_session
to get the session from bokeh and then modify it how you need to. On the other hand, if django is authoritative then using push_session
you can just build the Document
in django with the data from django and push it to the server.
The main problem I wanted to fix was to only have data updated when there was new data. From a django perspective, this means bokeh only updating when a save event occurs which can be done with the post_save
signal. So the idea is:
post_save
signalThere are a number of ways of doing this, in my second implementation I am going for the most simple, but there are other ways - and these are discussed at the end of this notebook.
Although I then re-wrote the example, it's worth noting that I can imagine scenarios where I would still use a setup like this. With the bokeh server still new, we don't yet have enough real world experience of what works well and not for scalable web-facing (not on a private network) applications. Please do share your experiences on the bokeh mailing list: https://groups.google.com/a/continuum.io/forum/#!forum/bokeh
In my second implementation, bokeh-server will be very naive and django will do all the work.
Django will push all the data to bokeh sessions and when there is new data django will push it to bokeh server. Bokeh server will then just focus on its job of updating the attached clients and pushing that new data down to it.
This cleans up a number of things, but does mean that in django I will need to keep track of bokeh sessions.
Given that django already has sessions, I can just add the bokeh session ids to the user's session and they'll be readily available for me to update.
It's worth noting that there are two options for updating a bokeh session and which one you pick depends on your use case. You can either store the session_id or keep the Session object. If you keep the Session object, you'll be keeping the sessions open between django and bokeh. This has the advantage that you don't need to download the document again when you want to update the data, but the disadvantage that you'll be keeping all the sessions open. If you just store the session_id then you'll need to use pull_session
to download the Document from bokeh server and then update the data.
Given in the happiness I don't expect data updates to be too frequent, I'm going with the second option of storing the session_id and not keeping the session open. I'm not sure this is the correct choice - please do share your experiences.
In doing it this way, my bokeh server is a completely empty implementation. There is no need for the four instances of individual
, individuals
, team
, teams
. All the bokeh code will live under my django views
and django will push the plots to bokeh. This should make everything cleaner and easier to follow what's going on as the code isn't in two places (under bokeh or django). It also means that I don't need to restart the bokeh server any time I want to make a change to a bokeh plot.
In [ ]: